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^ ' Abstract. Particle smoothers are widely used algorithms allowing to approximate the smooth- 

■^ I ing distribution in hidden Markov models. Existing algorithms often suffer from slow compu- 

CN ■ tational time or degeneracy We propose in this paper a way to improve any of them with 

J^ [ a linear complexity in the number of particles. When iteratively applied to the degenerated 

Filter-Smoother, this method leads to an algorithm which turns out to outperform existing lin- 
C^^ ■ ear particle smoothers for a fixed computational time. Moreover, the associated approximation 

O I satisfies a central limit theorem with a close-to-optimal asymptotic variance, which be easily 

estimated by only one run of the algorithm. 

Keywords: Degeneracy, Hidden Markov model, Particle smoothing. Sequential Monte- 
•^ . Carlo, Variance estimation 

5t 1- Introduction 

D 

A hidden Markov model (HMM) is a doubly stochastic process where a Markov chain 
{Xt}'^Q is only partially observed through a sequence of observations {Yt}f^Q. More pre- 
cisely, let X and Y be two spaces equipped with countably generated cr-fields X and y, 
respectively, and denote by Af a Markovian transition kernel on (X, X) and by G a tran- 
sition kernel from (X, A") to (Y,^). In our setting, the dynamics of the bivariate process 
{(Xk,Yk)}^Q follows the Markovian transition kernel 

P[{x,y),A]'':^'M^G[{x,y),A]= f f M{x,dx')G{x' ,dy')lA{x' ,y') , (1) 



where (a;, y) G X x Y and Ae X (g)y. 
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We assume that there exist nonnegative cr-finite measures A on (X, X) and /i on (Y, y) 
such that for any x S X, M(x, •) and G{x, •) are dominated by A and fi, respectively. This 
imphes the existence of kernel densities 



, dof dM{x, ■) , 
m{x,x) = — — — {x ) 



and 



, . dcf dG(x, •) 



In what follows, we simply write dx for A(dx). 

We are interested here in estimating the expectation of a function of {Xq, . . . , Xt) condi- 
tionally on the observations Yq, . . . ,Yt using particle smoothing algorithms. Many different 
implementations of the particle filters and smoothe rs have been proposed in the literatur e 
with different comput a tiona l costs; see for example IPel Morall ( 20041 ) : ICappe et al.l ( 20051 ): 
Doucet and JohansenI ( 20091 ) . So far, the existing particle smoothers rely on the so-called 
Forward- Filter whose complexity is linear in the number of particles N. In its simplest ex- 
tension, storing the pa ths of the Forwa rd-Filter allows to approximate the joint smoothing 
distribution as seen bv lKitaeawal ( 19961 ). This method known as the Filter- Smoother unfor- 
tunately suffers from a poor representation of the states corresponding to times t <^T . To 
circumvent th is drawback, the FF BS (Forward Filtering Backward Smoothing) algorithm 
introduced bv iDoucet et al.l ( 20001) adds a backward pass to the forward filter at the cost 
of a quadratic complexity when use d for approximating the marginal smoothing distribu- 
tions. However. I Godsill et al.l ( 2004 ) extended it to the FFBSi (Forward Filtering Backward 
Simulation), an algorith m which can be im plemented with a O (N) computational cost per 
time step as proposed bv lDouc et al.l ( 20101 ) when approximating the whole joint smoothing 
distribution. If we are interested on ly in approximation s of the marginal smoothing distri- 
butions, the Two-Filter smoother of iBriers et al.l ( 20101) may also be used as an alternative 
method. This algorithm originally suffers from a quadratic computational cost but has 
recently been modified in lFearnhead et al.l ( 20101) to get a linear one. 

Whereas more and more SMC-based smoothing algorithms are l inear in the number o f 
particles, there is a recen t surge of intere s t in m ixed strategies (see lAndrieu et al.l ( 20101) : 



Olsson and RvdenI (|2010[) or lChopin et al.l (|201ll )) where nice properties of SMC and MCMC 
algorithms are conjugated to produce better approximations. Whereas these methods are 
developed mostly in the framework of Bayesian inference for state space models, we focus 
here on the quality of the approximation of the smoothing distribution associated to a fixed 
Hidden Markov model. This is a crucial problem to address and the hope is to exhibit the key 
factors that affects the quality of the estimation. More precisely, fix (once and for all) a set of 
observations Yq, . . . ,Yt and try to approximate the law of Xq, . . . , Xt conditionally on the 
observations with a set of particles (^q , . . . , ^^ )fLi associated to equal or unequal weights 



■ 1 4t 
{uiT")iLi- For a fixed CPU time, how to build the best population of particles? Should we 



use mixed strategies? Can we obtain confidence intervals without additional Monte Carlo 

Since T is fixed, the 



passes? These are some of the questions we consider in this work. 



context of this work does not exactly correspond to the one of iGilks and Berzuinil (|200ll ) 
who propose to sequentially alternate SMC stages and MCMC stages as more and more 
observations are available. Nevertheless, the MCMC step called the Move stage by these 
authors is now included in the method proposed in this paper to form an efficient algorithm 
where some directional update of the components extends sequentially the diversity of the 
population from high values of t to lower values of t. Despite its simplicity, the resulting 
algorithm turns out to be more than a strong competitor to existing smoothing samplers. 
We propose here to improve any consistent particle approximation of the joint smoothing 
distribution by moving sequentially the particles according to a Metropolis-within-Gibbs 
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iteration. Such algorithm has a hnear computational cost and can be applied in particular 
to the Filter-Smoother to reduce the degeneracy without increasing the complexity. The 
paper is organized as follows: in Section [2l we describe the algorithm. In Section [3l we 
show that the limiting variance of the algorithm is reduced in comparison with the original 
SMC-based population with a multinomial resampling stage. One major characteristic of 
this algorithm is the fact that, by letting the number of iterations of the Markov chains 
proportional to In TV, the asymptotic variance is close to optimal and can be estimated using 
the evolution of only one population of particle paths. Up to our knowledge, this feature 
is totally new in the smoothing literature. Numerical experiments and comparisons with 
existing linear smoothers are provided in Section 2] for the Linear Gaussian Model (LGM) 
and the Stochastic Volatility Model (StoVolM). 

2. MH-lmprovement of a particle path population 

Denote for u < s, a^-.s = (om, a„-fi, . . . , Os) and define the smoothing distribution ng.Ti'p 
associated to a fixed set of observations Yq-t = yo-.r by: for any A e ;t'®(^+^\ 

A^f ! ■■■ Sx{'^xo)g{xo,yo) Y{^^.^m{x.i-i,Xi)g{xi,yi) lA(a;o:T)da;i:T 
no:T|T(A) = ■ 



/ ■ ■ ■ / x(da;o)5(a;o, J/o) Dj^i m{x^^i,Xi)g{xi, y^ 



dxi.,'1 



where x is a probability measure on (X, X). The distribution no:7'|7' is thus the law of Xq-t 
conditionally to Yq-t — yo-.T when Xq follows the distribution x- In the sequel, x is assumed 
to have a density w.r.t. A(da;), density which will be denoted by x by abuse of notation: 
x(da;) = x(a;)A(da;). Then, the density ttq-xit of the distribution Hq-t'ij' with respect to 
J|j^QA(da;t) writes 



T^O:T\TixO:T) OC x{xo)g{xo, yo) 



T 



Y\_m{xi-i,Xi)g{x^,yi 



.1=1 



(2) 



As noted in lGilks and Berzuinil (J2001r) . the smoothing density 7ro:T|T in © is known up 
to a normalizing constant so that approximation of this distribution can be perfectly cast 
into the general framework of the Metropolis-Hastings algorithm. Given that the resulting 
Markov chain evolves in the path space X^+^, the candidate at each iteration should be 
carefully chosen to keep the acceptance rate away from zero which is a delicate task in high 
dimensional spaces. Considering this, an appealing approach in the MCMC literature is the 
Gibbs sampler and more generally the Metropolis-within-Gibbs sampler which proposes to 
update only one component at a time. One could also choose to update components by 
blocks but as will be seen in Section 21 moving only one component at a time is sufficient 
for our purpose. A key point for exploring the posterior distribution within a reasonable 
number of iterations is that the algorithm should be well initialized at least for the first 
components to be updated. We propose here to achieve this by exploiting approximation 
of no:T|T provided by SMC-based algorithms. 

More precisely, suppose that we already have an approximation of 110:^1^ through a set 
of (normalized) weighted particle paths, {Cq-t ^ '^o-T)i^i ^^ the sense that 

TV TV 

Tlo.,T\T{h) ~^^o.!^h{Co.^) , ^wg:^ = 1 , (3) 

i=l z=l 



4 C. Dubarry et al. 

We intend here to improve this approximation by running N independent Metropohs-within- 
Gibbs Markov chains [Cq-t [k], k > Q) for i G {I, . . . , N} starting from each path Cq'-t' ^^^^ 
is, we set <^Q.y [0] = ^q.^, for i £ {!,..., N}. The resulting approximation after K iterations 
of the Markov chains then writes 

N 

nO:T|T(/i)«E^O:TMC:?[^])- (4) 

i=l 

Let us now detail the transition of (Cd-TWi^ k >0). For a simpler exposition, we drop here 
the dependence on i, N. Now, consider a family of transition kernel densities {rt)o<t<T such 
that ro, rT are transition kernel densities on (X, X) whereas for t G {1, . . . , T — 1}, rt is a 
transition kernel density on (X x X, <Y). For u, v,w,x G X, set 

, . del xix)g{x,yo)m{x,w) ro{w;v) 

ao[v,w;x) = -— -Al, (5) 

X(v)9[v,yo)m{v,w) ro{w]x) 

dcf m{u,x)g{x,yt)m{x,w)rt{u,w;v) i ^ . ^ rp -, ^f,^ 

at{u,v,w;x) = — — -— ^ -Al, l<t<T ~1 , (6) 

m[u,v)g{v,yt)m[v,w) rt[u,w;x) 

( - dcf m{u,x)g{x,yT) rrju^v) , . 

aT{u,v;x) = — — -Al. (7) 

m[u,v)g{v,yt) rT[u;x) 

At time fc, the new path ^o:t[^] is obtained by updating backward in time each component 
^t[k] as follows 

(i) Sample a candidate X '^ rt{^t-i[k — l],^t+i[A:], •), 

(ii) Accept ^t[k] == X with probability at{£,t-i:t[k - l],^t+i[k];X), 

(iii) Otherwise, set ^t[k] ~ ^t[k ~ 1]. 

This procedure is valid for t e {1, . . . , T— 1}; we skip the description of the updates for (,o[k] 
and £,T[k] since they follow the same lines under very slight modifications. The complete 
pseudo-code version of the Metropolis-Hastings Improved Particle Smoother (MH-IPS) is 
given below. 

Straightforwardly, for any t G {0, . . . ,T}, at is the classical Metropolis-Hastings accep- 
tance rate associated to the proposal kernel rt and the target distribution IIq.t\t- Due 
to the specific structure of IIo:T\t whose density is a product of quantities involving con- 
secutive components, the acceptance ratios in (O, (El) and ([7]) do not depend on the path 
space dimension and are therefore nondegenerated. Of course, it is also possible to up- 
date each component from an arbitrary number of neighbors. Nevertheless, in the Gibbs 
Sampler for which all the acceptance rates are equal to one, the t-th component is updated 
according to the distribution of Xt conditionally on Xg-t-i, Xt+i-.x, Yq-t which only depends 
on Xt-i,Xt+i,Yt. Such dependence suggests that the candidate in the Metropolis- within- 
Gibbs algorithm should be proposed according to a distribution which only involves its 
nearest neighbors. 

MH-IPS is based on a first approximation of Hq.tit given in Q whereas some SMC 
algorithms like the Filter-Smoother are known to suffer from a poor representation of the 
states close to but are accurate for states close to T. As a consequence, (^(' )^i for large 
values of t are well-distributed and this set of particles is then propagated to the poorer ones 
by updating the components backward in time. In other words, instead of a random-scan 
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Algorithm 1 MH-IPS 

1: Initialization 

2: Run an SMC-algorithm targeting H^.tit and store (^o'7'''^oT)i^i- 

3: Set: \/l<i<N, Co.!^[0] = Cq.^^. 

4: K improvement passes 

5: for k from 1 to K do 

6: for i from 1 to A'^ do 

7: Sample X-rT(C^!!i[fc- I];-), 

8: Accept ^y [fc] = A with probability ^^(Ct-itI^ ~ 1]'"'^)' 

9: Otherwise, set ^j, [k] = S,^ [k — 1]. 

10: for t from T — 1 down to 1 do 

11: Se.mp\eX^niC-,[k~llC+i[k];^ 

12: Accept ^j' [fc] = A with probability Q;t(^jLj.Jfc — l],^J:,^j[fc], A), 

13: Otherwise, set ^j'^ [k] = ^j-^ [fc - 1] . 

14: end for 

15: Sample A- ro(Ci'^[fc];-), 

16: Accept ^Q [k] = X with probability ao(Co [^ ~ l]i?i' [^]i^)j 

17: Otherwise, set ^o^[k] = ^*'^[A: - 1]. 

18: end for 

19: end for 



procedure where components are updated at random, this determistic-scan Metropolis- 
Hastings algorithm extends the diversity of the particle paths to the lower values of t at 
each backward pass. The fact that MH-IPS uses the SMC-based approximation just once 
and then, keep the N Metropolis-within-Gibbs Markov chains independent from each other 
implies that the path degeneracy vanishes as the number of iterations increases. Strong 
empirical evidences of this phenomenon are provided in Section 21 

A last but striking particularity of MH-IPS when compared to classical MH algorithms 
is the fact that the approximation ^ only involves the states at iteration K of the N 
Markov chains instead of using all the history of these Markov chains. Indeed, since only 
one component is updated at a time, the consecutive paths are highly positively correlated 
so that including them into (HI is detrimental to the quality of the approximation. Another 
advantage of considering only states at iteration K is that the CLT of the approximation ^ 
which is quite easy to establish when A oc In A includes a very simple and close-to-optimal 
expression of the asymptotic variance. The estimation of this variance can be performed 
using the evolution of only one population of sample paths. Therefore, on the contrary to 
all the smoothing algorithms proposed in the literature so far, confidence intervals can be 
obtained without additional Monte Carlo passes. 



3. Properties of the algorithm 

In this section, since the number of observations is fixed, T is dropped for simplicity from 
the notation. For example, we set H = IIq.t\T; C' = ^otit' '^*'^ = ^o-t\t ^^'^ ^° '^'^• 

The general procedure induced by MH-IPS can be described as follows. Let Q be a 
Markov transition kernel on (X'^+^, A'®'^-^+^^) with invariant distribution H. Consider a 
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set of normalized weighted particles (^'' 7<^''^)iLi ^nd move the particles independently 
according to the kernel Q. To be specific, define N independent Markov chains (^*' [k],k > 
0)^1 such that: 






r^^[o]=l 



i,N 



According to Q. Ilh is approximated after k iterations of the Markov chains by: 

N N 



uh^Yl ^'•^HC'^ik]), Yl ""^'^ = ^ 



(8) 
(9) 



(10) 



i=l 



3.1. A resampling step in tine initialization 

Let us first consider the impact of the weights on the quality of the approximation. A resam- 
pling step in the initialization consists in replacing the weighted particles (^*' , u}^'^)fLi by 

the unweighted particles (^ ' , 1/A^)^i such that some unbiasedness condit ion is fulfilled. 
Where a s many resampli n g strategies have b een d eveloped i n the literat u re (ILiu and Chen 
(|1998[) . iKitagawal (|l998l ). Icarpenter et all (|l999[ ): see also IDouc et al.l (l2005|) for a brief 



review of their different properties), we only focus here on the most simple one, the multi- 
nomial resampling: 



(i) (€ ' )?=! ^''6 independent conditionally on (^* 



N , ,i,N\N 



(ii) for all i,j e {1, . . 



,N},¥ ^' 



fi,N 



UJ 



i,N 



A straightforward calculation yields: 



N 



N 



Y&T[Y,^"-^Hi 



i.N\ 



< Var 



E'^crw 



\i=l 



\i=l 



showing that at time 0, the particle system with equal weights is less efficient than the 
one with original weights. Despite this, the resampling stage discards particles with small 
weights and duplicates "informative" particles (with high weights). As in the particle fil- 
tering theory, our hope is that the resampling stage increases the number of Markov chains 
starting from interesting regions with respect to the target distribution. 

Denote by ||-||rpy the total variation norm: ||/i||rpy = supiii 
sup^gx 1/(2^)1 and assume that 



|^(/)| where |/|c 



dcf 



(Al) For any x £ X^+\ limfc^oo \\Q''{x, ■) - Hi 



TV 



0. 



Under this assumption, it is straightforward that for any bounded measurable function h, 



E 



N 



,N 



Hi 



i,Nr 



is asymptotically unbiased whatever the weights are, provided their 



sum is equal to one. To go further, consider the effect of the weights on the second order 
approximation. The following proposition shows that as the iterations of the Markov chains 
goes to infinity, the quadratic error tends to a limit which is minimal when all the weights are 
equal to 1/7V. This advocates for a particle system with equal weights in the initialization 
as provided by a resampling step before letting evolve the N Markov chains. 



Particle approximation improvement of the joint smoottiing distribution witli on-ttie-fly variance estimation 

Proposition 1. Assume f-^ Jl]) . Then, for any bounded measurable function h, 



lim E 

A;— ^oo 



/ N 



J,Ni 



HC [k])-Ilh 



Yainih) E 



■ N 

E 



.,^'N\ 



where Varn (h) = Hh^ — (II/i) . Moreover, the previous limit is minimized when all the 
weights are equal: llj"^-^ = 1/N for all i G {1, . . . , N}. 

Proof. Proof is given the Appendix. D 

As a consequence of this proposition, it is assumed in the sequel that the multinomial 
resampling stage has been performed in the initiahzation, i.e. ([5]), ^ and (|10p are replaced 
by 



~i.N, , 

^ [0] 
I [k- 



Uh 



N 

E 

?=i 



~i,N 

~i N 

hit^'m/N, 



Then, according to Proposition [1] 



lim E 

k—>-oo 



N 



Y^hCt [k])/N^Uh 



\i=l 



Varn (h) /N 



(11) 
(12) 

(13) 



(14) 



Thus, when A'^ is fixed and k goes to infinity, (|14p shows that the approximation cannot be 
better than having N independent draws from the distribution 11. A natural question is 
now to properly tune the number of iterations k of the Markov chains to the number A^ of 
initial points so that the unweighted particles (^*' [fc], l/N)fLi have properties close to iid 
draws according to IT without letting k go to infinity. Before treating this question, let us 
examine some non-asymptotic result with respect to the approximation. 



3.2. Deviation Inequality 

Noting that (^ [k])fLi are i.i.d conditionally to J^^ === cr -^ ^ , J G {1, • ■ • , A^} > and that 



E 



h{t''[k])\fo'']=Q''hCt'' 



the conditional Hoeffding inequality directly yields: 
Proposition 2. For any bounded measurable function h, any fc G N and any e > 0, 

N 



Y^hCt [k])/N^nh 



> e 



< 2 cxp - 



Ne-" 



2{osc{h)Y 



N 



J2Q''hit'')/N-Ilh 



>e/2 



(15) 



where osc (h) = sup„ j,gx \h{u) — h{v)\. 

Nevertheless, when reading the inequality in Proposition[2j the question of knowing whether 
MH-IPS improves or does not improve the approximation is far from being obvious. We 
now answer this question in terms of the Central Limit Theorem. 
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3.3. Central limit theorem 

MH-IPS is based on a first approximation of Hh by a family of normalized weighted par- 



ticles (^*' ,'^^'^)iLi- For various versions of SMC methods, the asymptotic normality 

,i,N\N 



^i,N 



of (£''" ,a;''^^)iLi have already be e n obtain e d und e r different tec hni ques (see for exam- 
ple IPel Moral and Guionnetl (|l999t) . iKiinschI (|2000f ). IChoph] (|2004l) or IDouc and MoulinesI 
(2003))- The following proposition now focus on the effect of the multinomial resampling on 



the central limit theorem: whatever SMC method is chosen, if (£*' ,uj^' )iLi are asymp- 



,~i.N 



totically normal, then (^ , l/N)fLi are also asymptotically normal with Varn (h) as an 
additional term in the variance. 



..N 



Proposition 3. Assume that {C'" i'^^'^)iLi o.^^ asymptotically norm,al, in the sense 
that for any bounded measurable function h, there exists < (J^{h) < oo such that 



Ari/2 



TV 



Y^uj'^^h{e^)-nh 



.i=l 



V 



M{0,(J^{h)) 



Then, for any bounded measurable function h, 



7V1/2 



N 



,~J.N 



Y^hCt )/N-nh 



■JV{0,Yarn{h) + cr^{h)) 



The proof follows closely the lines of ( Chopinl . |2004 Theorem 1) or ( Douc and Moulines . 
20081 Theorem 4) and is omitted for the sake of brevity. 

Proposition [3] shows the asymptotic normality of (| ' [k],l/N)^^ for fc = 0. The 
Markov chains are then run independently according to the transition kernel Q and we now 
consider the impact on the approximation given in p^ for k = k^. To be specific, the 
following theorem shows that under the assumption that the kernel Q is T^-geometrically 
ergodic, for fc^r oc InA^, the unweighted particles (^ ' [ki\[],l/N)^^ are asymptotically 
normal with a reduced asymptotic variance. Define the following set of assumptions: 

(A2) There exists a measurable function V : X^+^ -^ [1, c«) such that 
(i) nV^ < oo and for any x £X and any fc G N, Q''V{x) < oo , 

(ii) there exists f3 G (0,1) such that for any /i £ Cy = {h; \h/V\oo < oo} and any 
a; e X, 

\Q''h{x) - Uh\ < l3''V{x) , 

(iii) the sequence {N~^ J2i=i ^^(^ ' )}a'>i of random variables is bounded in prob- 
ability. 

(.Algll-Q ensures that the quantities appearing in (A[2|)-(Inl) are well defined. (AlS])-® shows 
that Q is ^/-geometrically ergodic. (Al2|)- (pli| is a weak assumption concerning the initial 

unweighted particles (^ ' , l/N)fLi. If for example, (^ ' , l/N)fLi is consistent with respect 



,~J,N, 



to the function V'^ in the sense that X]i=i^^(€ ' )/^ converges in probability to HV^, 
then (A[2|)- (pli| holds. Condition under which suc h convergence resul t s hold for possibly 
unbounded functions may be found for example in IDouc and MoulinesI ( 20081 ) . 
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Theorem 1. Assume (J^^. Let {kj^)N>o be a sequence of integers such that 

lim kN + lniV/(21n/3) = oo . (16) 

N—yoo 

Then, for any h such that h^ G Cy , the following central limit theorem holds: 

N 

7V-i/2^ [Mr'^M) -n/i] A A^(0,Varn(/i)) . 

Proof. Proof is given in the Appendix. D 

Theorem [T] and Proposition [3] show that fcjv iterations of the Markov chains reduce the 
asymptotic variance when compared to a sample obtained by multinomial resampling of a 
population issued from any SMC method. The asymptotic variance Varn (h) in Theorem [1] 
is close to optimal since it is the same as for i.i.d. draws with distribution 11. Moreover, the 
expression of a^ (h) in Proposition [3] is usually quite involved and for obtaining confidence 
intervals, the estimation of the asymptotic variance in Proposition [3] is classically obtained 
by adding some Monte Carlo passes. This is not at all the case in Theorem [1] since esti- 

~i N 

mation of Varn (h) can be performed directly via (£ ' [/cat], l/N)fLi. Finally, by adding 
typically fcjv = — In N/ In (3 iterations of a transition kernel to a SMC-based population of 
particles, we obtain a sample with a reduced and close-to-optimal variance which can be 
easily approximated without additional simulations. 

The fact that the CLT holds for kN oc In A^ suggests that a good approximation of the 
target distribution may be achieved with only a few number of iterations of the parallel 
Markov chains. This will be confirmed empirically in the next section. 



4. Experiments 

The Filter- smoother is known to be quite easy to implement and efficient in terms of CPU 
time, but suffers dramatically from the degeneracy of the ancestors. We now see how 
only a few iterations of MH-IPS reduce the degeneracy and turn the Filter- smoother to 
a strong competitor to the existing smoother algorithms. In the sequel, denote by the 
Metropolis-Hastings Improved Filter- Smoother (MH-IFS), Algorithm [T] initialized with the 
Filter-Smoother. The performance of this algorithm is now compared to the other linear- 
in-A^ particle smoothers (Filter-Smoother, FFBSi, Two-Filter). In order to be as compu- 
tationally fair as possible, all these algorithms are implemented in the same way as their 
common base, the Forward-Filter. 



4. 1 . Linear Gaussian Model 

We first consider the LGM defined by: 

Xt+i = (j)Xt -\- GuUt , Yt=Xt+ a-aVt , 

where Xq ^ A/" ( 0, j^r^- ) , {t^tjoi and {Vt}^^^ are independent sequences of i.i.d. standard 
gaussian random variables (independent of Ai). T + 1 = 101 observations were generated 
using the model with (/) = 0.9, cr„ = 0.6 and ct„ = 1. Furthermore, in this model, the 
fully-adapted filters are explicitly computable when needed and the Gibbs sampler may be 
implemented. 
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The diversity of the particle population at each time step for ea ch algorithm is m e asure d 
by an estimate of the effective sample size N^^°{t) as defined in iFearnhead et alJ ( 20101) . 



Motivated by the fact that E [Xn - /i)^ /c^^ = 1/^^, when X^'^\ . . . , X^^^ are i.i.d. with 
E[X'^^)] = /i, Var(X(^^) = (7^ and X^ is their sample mean, we set 



N^^^°{t)''=E 



a\s.o.N 



TT I ° ' (Id) - fit 



C^t 



(17) 



where Id is the identity function on R, fit and at are the exact mean and variance of Xt 
conditionally to Yq-t obtained from the Kalman smoother. In some sense, the weighted 
sample produced by a given algorithm is as accurate at estimating Xt as an "independent" 
sample of size N^f^° (t) . The expression of N^f^° (t) given in (fTT)) shows that it is inversely 
proportional to the quadratic error associated to a normalized estimator of E(Xt|Fo:T)- 
To estimate the expectation in (|17p we use the mean value from 250 repetitions of each 
algorithm with a number of particles chosen such that the computation time of each of 
them is the same. 

Figure [T]a shows that when the number of improvements increases, the degeneracy of 
the particle population for small values of t decreases and for K = 8 all the time steps have 
the same diversity. 

Figure [T]b displays the effective sample size of the four linear smoothing algorithms. 
As expected, the Filter-Smoother is highly degenerated for small values of t as opposed 
to the other algorithms. Furthermore, the MH-IFS clearly outperforms all others within a 
fixed computational time. In order to check that this efficiency is not due to the fact that 
the LGM allows to easily implement the Gibbs sampler, we now turn to a model where a 
rejection sampling is required. 



4.2. Stochastic Volatility Model 

StoVolM have been introduced in fina ncial time series modeli ng to capture more realistic 
features than ARCH/GARCH models ( Hull and White! ( 1987 )). Despite its apparent sim- 
plicity, the following equations do not allow to directly simulate according to rt{u,w] ■) oc 
miu,-)g{-,yt)'m{-,w): 

Xt+i = aXt + aUt+i , Yt = /Se^Vt , 

where Xq ^ Af [0, ,^ ^ ) , Ut and Vt are independent standard gaussian random variables, 
r + 1 = 101 observations were generated using the model with a — 0.3, a — 0.5 and 
(3 = 1 in order to estimate the effective sample size defined in (J17p . The true values of fit 
and (7t cannot be computed explicitly so they are estimated by running the MH-IFS with 
N = 650000. 



4.2.1. Gibbs sampler 

In the StoVolM, the Gibbs sampler requires to sample exactly from 



rt{u, w; x) oc exp ■ 



c 2 


1+^2 

2a2 


X — 


( " 


2/32 2^* 


1^1+^2 



{u + w) 



a2/2 



1 2' 



(18) 
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(b) Comparison of four linear smoothing algorithms 

Figure 1: Average effective sample size for each of the 100 time steps of the LGM using 
different smoothing algorithms for a fixed CPU time. 



for 1 < i < T — 1 (the cases i = and t = T are dealt with in a similar way) which does not 
correspond to a classical distribution. However, we propose here to implement a rejection 
sampling. The first idea is to sample the proposal candidate X ~ x according to the a priori 
distribution of Xt conditionally to Xt-i = u and Xt+i = w. The corresponding ratio of 
acceptance is then given by (|2/t|//3) exp { — (x — l)/2 — c^^y^/(2/3^)} and will obviously lead 
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to poor results for small values of yt- To counterbalance the effect of yt in the acceptance 
rate, the proposal distribution should also take the value of yt into account; we then rewrite 
(fTS]) for any 7* > (possibly depending on yt): 



rt{u,w] x) oc e 



VVt 



X exp ■ 









l + a2 



1 + ^2 



(19) 
which suggests to propose x according to M I -^ " ^ {u + w) — iJ^i (1 ~ 7t)i il^i ) and to 
accept it with a probability given by: 



\yt\ 



l/2o 



exp 



' 2 



(--i)-^y? 



(20) 



An optimal choice for 74 would consist in maximizing the smoothed expectation of (|20p but 
this quantity is intractable. An intuitive choice for 7^ is then: 



It 



{\vt\lfi)\ if|yt|</3, 

\yt\ll3, if|yt|>/3. 



(21) 



Indeed, for small values of yt, ((20|) is then close to one and for bigger values, the exponential 
becomes very small but the first term remains non- neglect able. 




Figure 2: Average effective sample size for each of the 100 time steps of the StoVolM using 
different smoothing algorithms for a fixed CPU time. 



The Improved Filter-Smoother used to generate Figure [5] performs simulations using the 
Gibbs sampler with the previous rejection sampling. We can see that this algorithm still 
leads to better results than the other ones within an equivalent computational time. 

In many instances (for example Expectation-Maximization algorithm, score computa- 
tion), it is necessary to estimate smoothed additive functionals such as IiQ.rp\j.[^H) where 
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for all xo:T e X +^, H{xo;t) = J2t=o^t- ^^ order to assess the smoothing algorithms on 
this matter, T + 1 = 1001 observations were generated. As seen before, the computational 
cost of the MH-IFS is linear in N which is verified by numerical experiments in Figure [S] 
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Figure 3: Average CPU time for computing a smoothed additive functional with the MH-IFS 
as a function of the number of particles. 

Figure [31a shows that the variance vanishes quickly with the number of improvement 
passes and only 4 iterations of the Markov chains are sufBcient to get an efhcient estimator. 
Then, the variances displayed in Figure |4lb allow again to draw the conclusion that for a 
fixed CPU time, the MH-IFS is more efficient than the Two-Filter. Finally, one improvement 
pass has been applied to the particle paths given by the FFBSi. The variance reduction is 
again significant as shown in Figure |31c. 



4.2.2. Metropolis-within-Gibbs and confidence interval 

In order to assess Algorithm [T] in the case where the Gibbs sampler could not be imple- 
mented, we now turn to the Metropolis-within-Gibbs sampler which is implemented by 
using again the proposal distribution; 



rt{u,w; •) ~ A/" 



l + a'^ 



(u + w) 



(7V2 

l + a2 



(l-7t), 



l + a2 



where jt is defined in (j2ip . and the associated acceptance rate is now given by: 



at(u, V, w; x) = exp 



It 



(x — v) 



2/32 



-Vt 



Al . 



Figure [5] compares the empirical variance of the Gibbs and Metropolis-within-Gibbs 
samplers of the smoothed additive functional conditionally to the T+1 = 1001 observations 
used previously. The efficiency of both algorithms is equivalent, showing that Algorithm [1] 
remains a great performer even when exact a posteriori simulation is not possible. 

Finally, Theorem [T] is assessed in Figure El The empirical variance of the estimator 
given by Algorithm [1] run with Kn oc In iV has been computed over 250 runs using the 
Gibbs and the Metropolis-within-Gibbs samplers for different number of particles TV and 
compared to the asymptotic variance Varn {h) /N estimated through only one population 
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(b) Variance of the Two-Filter Smoother and the Improved 
Filter-Smoother according to the CPU time 
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Figure 4: Variance of different smoothed additive functional particle estimators in the 
StoVolM. 




of particles. The results show that it is possible in practice to get a confidence interval for 
the approximation with only one run of Algorithm [1] of complexity 0{N In A^). 
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Figure 5: Variance of the Gibbs and Metropolis-within-Gibbs samplers according to the 
CPU time. 
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Figure 6: Algorithm [T] variance according to the number of observations. 



5. Conclusion 



At first sight, one could fear that the MH-IPS is too slow since the updates concern only 
one component at a time. The various comparisons performed for a fixed CPU time in the 
previous section show that this is not the case at all. Roughly speaking, a backward pass 
in the MH-IPS proposes to sequentially modify each component of the N parallel Markov 
chains. This can be seen as one run of N particles through T + 1 observations which is 
computationally equivalent to one pass of the bootstrap filter. By empirical evidences, we 
have seen that only a few backward passes {K = 4 or 8 in the examples) of the MH-IFS 
sweep out the degeneracy of the ancestors by extending backward in time the diversity of 
the particles. 

This method is linear in N and outperforms other existing algorithms as the FFBSi or 
the Two-Filter within a fixed CPU time. These performance results may be explained by 
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the fact that in the FFBSi algorithm, the points are sampled in the forward pass once and 
for all; the backward pass in the FFBSi only modifies the weights of the particles without 
moving them. On the contrary, the MH-IPS allows in the backward pass to move the 
particles and thus to explore interesting regions of the posterior distribution. In the Two- 
Filter sampler, two populations (the "forward" population and the "backward" population) 
evolve independently. At time t, a particle is sampled after choosing a couple of particles 
at time t — \ and t + \. The two components of these couples belong to independent 
populations and it is likely that even if their weights are respectively high, associating 
these independent particles could be detrimental to the approximation. On the contrary, 
in the MH-IPS, even if the Markov chains are independent, the proposed modification of 
the component is sampled with respect to its two neighbors which both belong to the same 
Markov chain. Note that we did not compare this algori thm to the Populati on Monte 
Carlo by Markov chains (PMCMC) samplers introduced bv lAndrieu et al.l ( 20101 ) since the 
framework here is not the Bayesian inference of parameterized Markov chains. 

Another major advantage here is the fact that a CLT can be obtained with a very 
simple asymptotic variance which can be estimated with only one run of the Algorithm 
and a complexity in 0{N\aN). This is totally new in comparison to all the smoothing 
algorithms proposed in the literature so far, where the asymptotic variances are usually 
particularly involved. Thus, for a fixed CPU time and only one run, this algorithm is able 
to produce both approximations of the smoothing distributions and confidence intervals. 

Finally, we only focus here on the MH-IFS since it is efficient enough for our purpose. Of 
course, many other variants with different SMC-based approximations in the initialization 
step may be performed. In the context of the paper, the MH-IPS only uses the SMC-based 
approximation once before starting independent MCMC Markov chains. The empirical 
performances of this algorithm, namely with respect to the diversity of the population and 
the precision of the approximation, seem to us convincing enough to let the Markov chains 
ev olve independently withou t trying to interact them again. Of course, as previously noted 



Gilks and Berzuinil ( 20011 ). in some different contexts, where for example, the observa- 



tions are available sequentially whereas approximations of the smoothing distributions are 
needed at each time, some variants with SMC steps mixed with MCMC steps can also be 
elaborated. Nevertheless, in the framework of this paper, the number T of the observations 
is fixed and we only focus here on how the independent MCMC steps drastically improve 
the first approximation obtained by SMC algorithms. In this context, there is no need to 
interact again the Markov chains; this allows to keep the diversity of the population while 
approximations and confidence intervals are obtained without effort. 



Appendix 

A. Proof of Proposition [H 

For all fc > 0, the bias plus variance decomposition writes 
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where Jo^ = cr||*'^,a;*^^, i £ {1, . . . , A^}|. Now, by definition of |*^^[fc], ie{l,...,N}, 
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and tfie first term of tlie RHS of ([22]) is bounded by 
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Tlie RHS goes to as fc tends to infinity by tlie Lebesgue convergence theorem since h is 
bounded. The same argument holds to handle the second term of the RHS of 
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Finally, conditionally to J-q , the random variables (^'' [k])iLi a-re independent and 
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This shows the first part of the proposition. Now, by the Cauchy-Schwartz inequality: 

1/2 



Af / N \ ^'^ 

i=l \4=1 / 



i.e. J2i=ii^'''^)'^ — 1/^ with equality only for o;''^ = 1/A^ for all i. The proof is completed. 
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B. Proof of Theorem [1] 

Let 7jv = fcjv +ln iV/(2 In /3). Under the assumptions of Theorem[Tl liniAr_>.oo 77V = 00. Now, 
write 



TV" 



i=i 



j=i 



-^/^E[Mr''[M)-Q'="Mr'')] . (23) 



Since V > I, {P^-^ implies that {N~^YJi=i'^{l'' )}n>i is bounded in probabihty. 
Combining this with 
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shows that the first term of the RHS of ([^5)) converges in probabihty to 0. Now. the second 
term of the RHS of ^ writes 



N- 



N N 



where 



~i,7V, 



-Fa.,, = a [t"" ^t" [kN]. {e,j) e {1, . . . ,*F} 



To apply (JDouc and Moulinesi l2008i Theorem A3) with Mn == N and ct^ = Varn (h), we 
need to check that 
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As /i^ G Cy, the first term of the RHS is upper-bounded by 

N 
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which converges in probability to 0. Now, note that the functions h^ and V are in Cy and 
\h\ < max(/i2,l) < max(/i2,T/) so that h G Cy- By applying |a^-6^| < |a- 6|^ + 2|&||a- &|, 
the second term of (p6|) is then upper-bounded by 

i=l i=l 

which again converges in probability to 0. This proves ([2^ . Now, let e > 0, 

N 



z2^\-^^,i'^{\UN,,\>e}\j^N,i-l\ 
1=1 

i=l 

N 

< n [hH[^2^,2^^] + /3^-" X 7V-1 ^ vCt'') , (27) 






where /i^l{;/i2>j2jv} e Cy. Since h'^ e Cy,(Al2])-(|I]) implies that II/i^ < cx). Then, the RHS 
of (j27p converges in probability to 0, showing (P5|) . The proof is completed. 
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